Creating a Parallel Corpus from the \ Book of 2000 Tongues "
نویسندگان
چکیده
1 Abstract This paper reports on a project to annotate biblical texts in order to create an aligned multilingual Bible corpus for linguistic research, particularly computational linguistics, including automatically creating and evaluating translation lexicons and semantically tagged texts. The output of this project will enable researchers to take advantage of parallel translations across a wider number of languages than previously available, providing, with relatively little eeort, a corpus that contains careful translations and reliable alignment at the near-sentence level. We discuss the nature of the text, our annotation process, and intended uses for the corpus, and we point out relevant aspects and potential limitations of the current draft of the Corpus Encoding Standard with respect to this corpus. 2 Why this text? 2.1 The nature of the text The Bible is a widely available, representative sample of carefully translated texts in a variety of styles in a wide range of languages. These properties uniquely suit our research purposes, which include construction of translation lexicons, and evaluation of semantic tagging for multilingual machine translation and other natural language processing applications. The text is a single cohesive document comprising 66 books by 30-40 authors in a variety of text styles. The corpus provides a representative sample of language styles in the source texts, including narrative, poetry, and correspondence. The New Testament corpus alone \compares favourably in size to other major collections analysed by scholars ... approximately as large as if not larger than the corpus of Homer's Iliad, of Homer's Odyssey, of Sophocles, of Aeschylus, of Herodotus ... with] individual books ... comparable in size to other well-known classical texts: e.g. Plato's Apology approximates the size of Paul's Romans or 1 Corinthians" (Porter, 1989). As a resource for research using corpus-based statistical methods in computational linguistics, the Bible is small by current standards (e.g. see (Church and Mercer, 1993)); with some variation 1
منابع مشابه
Comparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملThe Cambridge Learner Corpus - error coding and analysis for lexicography and ELT
The Cambridge Learner Corpus is a 16 million-word corpus of Learner English collected by Cambridge University Press in collaboration with the University of Cambridge Local Examinations Syndicate (now Cambridge ESOL). It comprises English examination scripts, transcribed retaining all errors, written by learners of English with 86 different mother tongues. The scripts range across 8 EFL examinat...
متن کاملExperimenting in Tongues: Studies in Science and Language
Only for you today! Discover your favourite experimenting in tongues studies in science and language writing science book right here by downloading and getting the soft file of the book. This is not your time to traditionally go to the book stores to buy a book. Here, varieties of book collections are available to download. One of them is this experimenting in tongues studies in science and lan...
متن کاملMaterial Development and English for Academic Purposes Word Lists; a Reductionist Approach
Nagy (1988) states that vocabulary is a prerequisite factor in comprehension. Drawing upon a reductionist approach and having in mind the prospects for material development, this study aimed at creating an English for Academic Purposes Word List (EAPWL). The corpus of this study was compiled from a corpus containing 6479 pages of texts, 2,081,678 million tokens (running words) and 63825 types (...
متن کاملBetween Comparable and Parallel: English-Czech Corpus from Wikipedia
We describe the process of creating a parallel corpus from Czech and English Wikipedias using methods which are language independent. The corpus consists of Czech and English Wikipedia articles, the Czech ones being translations of the English ones, is aligned on sentence level and is accessible in Sketch Engine corpus manager.1
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998